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Abstract 

We describe a system that simplifies the process of debugging programs produced by computer-aided parallelization 
tools . The system uses relative debugging techniques to compare serial and parallel executions in order to show where the 
computations begin to differ If the original serial code is correct , errors due to parallelization will be isolated by the com- 
parison. 

One of the primary goals of the system is to minimize the effort required of the user. To that end , the debugging system 
uses information produced by the parallelization tool to drive the comparison process. In particular the debugging system 
relies on the parallelization tool to provide information about where variables may have been modified and how arrays are 
distributed across multiple processes . User effort is also reduced through the use of dynamic instrumentation. This allows us 
to modify the program execution without changing the way the user builds the executable.The use of dynamic instrumenta- 
tion also permits us to compare the executions in a fine-grained fashion and only involve the debugger when a difference has 
been detected. This reduces the overhead of executing instrumentation. 

1 . Background 

One of the problems facing scientific programmers on high-end computers is that as performance requirements drive up 
the complexity of machines, they also drive up the complexity of programming models used on them. As a consequence, 
debugging such codes becomes more difficult. In this paper we describe how automated debugging support can alleviate 
some of those problems. We begin by providing some background on the target machines and programming model that we 
are addressing. 

1.1 Programming Distributed Memory Computers 

A common approach for delivering high performance in computers today is to use a distributed memory architecture. 
Such a computer consists of a number of processors connected together in a network. Each processor has its local memory 
that it can access directly. Data from other processors must be accessed via the network. 

In this paper we consider the SPMI) (Single Program/Multiple Data) programming paradigm, where each processor exe- 
cutes the same program on a subset of the total data. Using this paradigm, computations being performed by one process 
will often require data calculated on another process, and data has to be moved between the processes. This data movement 
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is typically performed by explicit message passing from one processor to another using a message passing library like MPI 
[17] or PVM [20]. The development of a parallel program based on message passing adds a new level of complexity to the 
software engineering process since not only the computation, but also the explicit movement of data between processes must 
be specified. 

Given the enormous investment made in existing scientific applications, there is a strong incentive to produce parallel 
versions through a conversion process rather than re-implementing from scratch. 

1.2 Converting Serial Codes to Message Passing Parallel Codes 

When converting a sequential program into parallel code, one way to achieve parallelism is to partition the elements of an 

array among the processors and have each processor update only the anay elements that are assigned to it. A straightforward 
way to convert a serial loop into a parallel loop based on message passing is to distribute the loop iterations among the pro- 
cessors. The array is logically partitioned into chunks, and each processor is assigned one or more of the blocks*. The pro- 
cessor is then responsible for updating the array elements assigned to it. For example, the Fortran loop 

do i = 1, n 

a( i) = b< i) ♦ 2 
end do 

could be parallelized by splitting up arrays a and b into contiguous sections. Each processor would execute: 

do i = lower, upper 
a ( i > = b ( i ) + 2 
end do 

where lower and upper denote the lower and upper index of the array section assigned to the processor. Now consider the 
loop: 

do i = 1 , n 

a { i ) = b(i-l) ♦ 2 
end do 

If array b is partitioned the same way as array a, processor/? will have to access data from processor p- 1 . Therefore calls to 
communication routines have to be inserted. Processor p has to send b{ upper) to processor p+1 and receive b { lower- 1 ) 
from processor p - 1 : 

call send (b( upper ) , 1, real, p+1, ierr) 
call receive (b { lower-1 ) , 1, real, p-1, ierr) 
do i = lower, upper 
a(i) = b{i-l) + 2 
end do 

The loop: 

do i = 1 , n 

a(i) = ( a ( i ) + a(i-l) ) * C .5 
end do 

can not be executed in parallel since data from iteration i is dependent on data from iteration i-1. 

In order to determine whether a loop can be parallelized and which updates require data from another process, the array 
indices have to be analyzed in order to detect dependences. The analysis has to be done between individual statements, iter- 
ations of a loop, and subroutine calls. The technique of dependence analysis is well understood [21] and has been imple- 
mented in compilers for code optimization. 


"Besides the block partitioning used in the example, cyclic or block cyclic distribution of array elements is common. 
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Discovering dependences manually and then inserting the necessary message passing calls is a tedious and error-prone 
task. There are several systems that assist the user in the task of parallelizing codes. For example, the CAPTools system from 
the University of Greenwich [7] can take a serial program in Fortran77 and. with some user guidance, turn it into a message 
passing parallel program. The user's role in this process is fairly modest. While the tool is analyzing the serial code, it may 
ask the user for additional information in order to perform a more precise dependence analysis. After the analysis is done, 
the user chooses a distribution for one or more of the arrays. Then, when CAPTools is producing the parallel version, it may 
ask additional questions about the relative values of variables such as "Can N be larger than m7\ The result of this process is 
a message passing version of the code 
13 Errors in Automatically Parallelized Codes 

There are several reasons why the automatically generated version might produce results that differ from the serial original 
version: 

1. Parallelization may change the order of execution in some loops and lead to numerical discrepancies. For example, 
performing a sum reduction in a different order could produce different results because of the non-associativity of 

floating point addition. 

2. The serial code may have errors For example, if the serial code references an undefined variable, execution of the par- 
allel code may produce a different result by reading a different value for the undefined variable. 

3. The tool for automatic parallelization may be buggy. 

4. The user can introduce errors by providing incorrect inputs to the tool, e.g.. incorrect responses to the system’s queries 
or incorrect removal of depende nces. 

Discrepancies due to reasons 1 and 2 are not specific to parallelization, but can arise from any compiler optimization. Dis- 
crepancies due to errors in the tool depend on the maturity of the tool and are hoped to be rare. Discrepancies due to incorrect 
user inputs can take several forms, as we discuss below. In order to be as specific as possible in these scenarios, we consider 
user interactions with, and code generated by, the CAPTools system. Similar errors are possible with other parallelization 

tools. 

By its nature, the CAPTools system, like any parallelization support tool, has to be conservative in its assumption about 
data dependences. Often the existence of a data dependence will depend on certain input parameters. In such a case parallel- 
ization will not be performed in order to assure correctness of the code for all possible input values. The strength of the 
CAPTools system lies in the fact that it allows the user to provide information about values of certain variables. This knowl- 
edge leads to a more precise dependence analysis and the generated parallelized code will be highly efficient. The drawback 
is that this also opens the window for the introduction of errors. Consider the following code fragment as an example: 

program linearize 

C main routine 

double precision phi2 ( 100 * 100) , phi3 ( 100 , 100 ) 
read idim 


j = 1, 100 
do i = 1 i 100 


phi2 ( i + j*100*idim) = phi2(i + j*100*idim) + . 

0.5 * (phi3 ( i-1 , j ) +phi3 ( i+1 , j ) +phi3 ( i , ] -D +phi3 ( 1 , 3 +1) ) 


end do 
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end do 

call output (phi2, nptsx, nptsy) 

end 

If idim > 0, the j-loop can be parallelized. If idim=0, the j-Ioop carries a dependence, since phi2 ( i) from iteration j 
is used in iteration j + 1. The value of idim is not known at compile time. By default, CAPTools will assume a dependence 
and not parallelize the loop. However, it is possible for the user to inform CAPTools that idim>0. If this is not true for cer- 
tain sets of input data, the produced results will be incorrect. 

There could also be situations whe e CAPTools does not have access to all of the source code. For example, suppose the 
source code of subroutine sub is not provided, and the user incorrectly says that the statement 
call sub(a, n) 

does not modify array a. This may result in a parallelized loop where a processor uses stale values for parts of a instead of 
the up-to-date ones residing on another processor. 

As a third scenario, consider the loop: 
do j =1, n 

a(ind(j)) = a<ind(j>) * 0.5 
end do 

If CAPTools has no further information about the array ind, it will assume a dependence and not parallelize the loop. The 
user may have the knowledge that ind ( j ) = j , which would make the loop parallelizable. For these situations CAPTools al- 
lows the user to explicitly manipulate the results of the CAPTools dependence analysis which is stored in form of a depen- 
dence graph. Again, this creates the possibility of introducing errors. In section 6 we give a detailed example of this scenario. 

A programmer trying to isolate such bugs in the parallel program faces a daunting task. Not only has the serial source 
code been sprinkled with calls to communication libraries, but the loop structure may have undergone transformations as 
well. Since the parallelization tool attempted to optimize the communication patterns, the programmer must use sophisti- 
cated reasoning to determine whether a processor is in fact using a current or stale value. Figure 1 contains a small example 
of how serial code is transformed by CAPTools . The communication calls inserted there, such as cap_exchange, refer to 
CAPTools-prov\ded routines that are implemented in an appropriate way for the target machine (e.g., in MPI). 

From the programmer's perspective, rather than attempting to debug the parallel program directly, a more promising 
approach is to determine where the parallel computation begins to differ from the serial one. This could be done by instru- 
menting both codes with print statements and examining the outputs, or by running two debugging sessions side-by-side. 
Both of these approaches have the drawback, however, that the programmer is required to deal with the tool-produced code 
in the parallel version. 

The goal of our work is to provide support for automatically finding bugs in programs parallelized with tools. We feel that 
such a goal is feasible because we have: 

• a reference program (serial code) for determining the expected behavior, and 

• mapping information from the parallelization tool that conveys how the serial program was transformed into the par- 
allel one. 

This combination permits the debugger to do side-by-side executions of the serial and parallel versions of the code. In partic- 
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program main 

real *8 u{0:33, 0:33), v(0:33, 0:33) 
call loop (u, v) 
call dummy (u, v) 
end 

subroutine loop (upar, vpar) 

real*8 upar (0:33, 0:33), vpar (0:33, 0:33) 
integer i, j, dl , d2 
dl = 33 
d2 = 33 
do i = 0 , dl 

do j = 0 , d2 

upar ( i , j ) = 0 . 
vpar ( i , j ) = 1 • 
end do 
end do 
do i = 1 , 32 
do j = 1, 32 

upar ( i , j ) = upar ( i , j ) + 0.25 * 
f (vpar(i-l,j ) + 

* vpar(i+l,j ) + 

¥ vpar ( i, j-1) + 

¥ vpar(i, j+1)) 

end do 
end do 
return 
end 


PROGRAM FARATiTiETjnaln 

INTEGER CAP_LEFT , CAP_RIGHT 
PARAMETER (CAP_LEFT=-1 , CAP_RIGHT=-2 ) 

REAL* 8 u(0:33, 0:33) ,v(0:33, 0:33) 

INTEGER C AP_BLu , CAP_BHu 
COMMON /CAP_RANGE/CAP_BLu,CAP_BHu 
INTEGER CAP_ICOUNT 
CALL CAP_INIT 

call CAP_SETUPPART(0, 3 3 , C AP_BLu , CAP_BHu ) 
call loop ( u , v , CAP^BLu , CAP.BHu ) 
call dummy (u,v) 

CALL CAP_FINISH() 

END 

subroutine loop (upar, vpar, 
k CAP_Lupar , CAP_Hupar ) 

integer CAP_LEFT, CAP_RIGHT 
PARAMETER ( CAP_LEFT= - 1 , CAP_RIGHT= - 2 ) 

REAL* 8 upar (0:33,0:33) ,vpar(0:33, 0:33) 

integer i,j,dl,d2 

integer CAP_Lupar , CAP_Hupar 

COMMON /CAP_RANGE/CAP_BLu,CAP_BHu 

integer CAP_BLu , CAP_BHu 

integer CAP_j 

dl = 33 

d2 = 33 

do i=MAX(0,CAP_Lupar) , MIN (dl , CAP_Hupar ) ,1 
do j=0,d2,l 
upar ( i , j ) =0 . 
vpar ( i, j ) =1 - 
enddo 
enddo 

do CAP_j =1,32,1 

CALL CAP_EXCHANGE (vpar (CAP_Hupar+l , 

+ CAP_ j ) , 

+ vpar ( CAP_Lupar , CAP„ j ) , 

+ 1, 3,CAP_RIGHT) 

enddo 

do CAP__j=l, 32,1 

CALL CAP_EXCHANGE ( vpar (CAP_Lupar-l , 

+ CAP_j ) , 

+ vpar (CAP_Hupar , CAP_ j ) , 1 , 3 , CAP_LEFT) 

enddo 

do i =MAX ( 1 , CAP_Lupar ) , MIN ( 32 , CAP_Hupar ) , 1 
do j = 1 , 32 , 1 

upar ( i , j ) =upar ( i , j ) +0 . 25* (vpar ( i-1 , j ) + 

+ vpar ( i+1 , j ) +vpar ( i, j-1) +vpar ( i , j+1) ) 

enddo 
enddo 
return 


Original serial code. Output of CAPTools . 

FIGURE 1. How CAPTools transforms a serial loop. 

ular, the user could compare corresponding states between the two executions without being required to look at the parallel 
code. In the next section we will discuss possible approaches for automating the execution comparison process. 


2. Relative Debugging of Automatically Parallelized Programs 

There are many situations in software development where it is helpful to find out how two related programs differ in 
behavior. One example is that of locating a bug that was introduced between successive versions of a program. Relative 
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debugging [1] is a technique that compares data during execution between a program that produces correct results and one 
that produces faulty results, to narrow down at which point discrepancies occur. 

The technique of relative debugging is directly applicable to the situation of debugging automatically parallelized code 
since we can assume the existence of a sequential version that produces the correct results. Let us assume we have a sequen- 
tial program P s and a parallel program P p that has been derived from P s by running its source code through a parallelization 
tool such as the CAPTools program described earlier. If P p crashes or produces wrong results, we could isolate the bugs by 
comparing data between P v and P p . In doing such a comparison, there are several issues to address. 

What data values should be compared between the two executions? A good starting point is a user-specified value 
that has been determined to be incorrect by examining the results of a previous run. The testing could be made more precise 
by also comparing values used to define the known incorrect one. 

When during execution should they be compared? One possibility would be to perform the comparison immediately 
after any statement that could change a value of interest. This might be prohibitively expensive in execution time. Another 
approach is to do a comparison before and after every subroutine execution that could change the value. This effectively 
brackets the error location to one subroutine. A combination of both methods would first narrow down the discrepancy to a 
subroutine by coarse-grained comparisons, then re-execute and apply fine-grained comparison within the subroutine. 

How do we know if the values are different? Testing equality is something that will vary from application to applica- 
tion. For example, in some programs, scalar values may be considered “the same'* if they are within some tolerance. Arrays 
might only be considered the same if all corresponding elements are equal. Alternatively, it may be acceptable to calculate 
checksums of arrays and then compare the sums. 

How do we get values from multiple address spaces to a place where they can be compared? There are at least three 
approaches for this. 

• One way to perform the comparison debugging would be to manually insert statements to print the array data to a file, 
recompile, rerun, and then inspect the printed data from the two executables. The drawbacks of this method are obvi- 
ous, particularly when many processes are involved. 

• Another way would be to use an enhanced debugger that controls both executables. We then have the debugger insert 
breakpoints, compare the data at the breakpoints, and stop when differences are detected. This approach is taken by 
the GUARD project [1][2][3]. 

• A third technique is to have the two computations establish communication, transmit and compare their data, and stop 
when differences are detected. This can be achieved by instrumenting the source code with routines that send or 
receive data and perform the comparison. This approach was used as early as 1985 at NASA Ames to debug an FFT 
code that had been ported to a 4-CPU Cray 2 and showed subtle intermittent problems [4]. We have subsequently 
successfully employed this technique when porting codes to new machine architectures. 
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How is distributed date handled, as in the case where a distributed array is being compared to a serial analog? If 

an element-by-element comparison is requested, the distributed array needs to be reconstituted. Thus, array distribution 
information is required. If checksums are being compared, each process in the parallel computation could calculate a partial 
checksum. Those values could then be aggregated and compared to the serial checksum. 

2.1 The role of the parallelization tool 

If a tool is used in the parallelization process, some of the questions of the previous section can be answered without user 
intervention. Parallelization support tools such as CAPTools perform four major steps: 

• data dependence analysis across statements, iterations of loops, and subroutine calls. 

• partitioning of array data, 

• masking calculations (such as distributing loops), and 

• generating necessary calls to communication library routines. 

If all the information generated in those steps is gathered in a database, the following kind of information is statically avail- 
able: 

• definition- use chains for array elements across statements and subroutines resulting from dependence analysis and 

• information about which part of an array belongs to a certain processor, resulting from data partitioning. 

The first item of information can be used to identify those functions and subroutines that modify a certain array and should 
therefore be instrumented for comparison. The second item can be used to determine how a distributed array maps to its serial 

analog. 

2.2 The Role of the Distributed Debugger 

In a relative debugging system for tool-parallelized codes, the distributed debugger provides the interface for the user. For 
example, the user could steer the comparison activities by selecting the arrays that should be compared. In addition, the 
debugger controls the executions being compared, retrieves information from the parallelization database, and instruments 
the target programs by having appropriate function calls inserted dynamically into the executables. Finally, the presence of 
the debugger will permit more extensive state examination and control of execution during the steps taken to isolate the par- 
allelization bugs. 

3. Prototype Implementation 

As part of ongoing work in a debugger research project at NASA Ames, we have built a prototype relative debugging sys- 
tem for tool-parallelized codes where we try to minimize the amount of user intervention required. In this section we 
describe its implementation and in the process discuss how it answers the questions raised in the previous section. 

3.1 Determining Locations and Variables to be Compared 

Besides being used to produce the message passing program, CAPTools will provide vital information to the debugging 

system. At the heart of CAPTools is a dependence analysis system that examines the serial code in order to establish the 
safety of running loop bodies in parallel. After performing dependence analysis, it transforms the serial code to parallel 
form, inserting calls to communication libraries, as needed. The results of CAPTools’ s sophisticated analysis and transfor- 
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mation phases are stored in a database in the file system. This fact makes it possible for a debugger to find out how a serial 
array was distributed for parallel execution and which routines modify that array [15]. 

CAPTools stores the results from the data analysis and partitioning process in a database. From the dependence analysis 
phase, information about location of statements that assign to a particular array can be obtained. The CAPTools developers 
have provided us with routines that, given any statement in a program and an array name, construct a list of all routines that 
might define the array value that reaches the statement. In our prototype implementation, the user provides the name of a 
subroutine and a variable that has been identified as having an unexpected value. For example, if phi is a suspicious variable 
in subroutine output, then, by probing the database we might obtain the following information: 

copy : phi 5 
update :phi4 
setup_grid : phi 6 

This information tells us, that variable phi5 in routine copy, variable phi4 in routine update, and variable phi6 m 
routine setup.grid might define the suspicious variable phi. Therefore, we will perform a comparison of these variables 
in the corresponding subroutines. Where the comparisons should be performed depends very much on the desired granular- 
ity of the comparisons. For example a comparison could be performed every time a distributed array is written to. We 
restrict ourselves to inserting comparison routines on entry and exit of routines that modify the distributed arrays of interest. 
The restriction is due to limitations in our prototype implementation and will be discussed in Section 7. 

Having determined where and what to compare, we also need to address the situation where the suspicious variable is a 
distributed array. To perform a comparison we need to know how distributed data from the parallelized program maps onto 
its undistributed counter part. Again, we can retrieve this information from the data base with routines provided to us by the 
CAPTools developers. If phi is an array declared in subroutine update, we obtain the following information when probing 

the data base: 

Information for symbol ph:. in routine update: 

symbol -name : phi 

declaration: (16,16) 

dimensionality: 2 

partition-index: 1 

partition-bounds: CAP_LOW_phi , CAP_HIGH__phi 

This tells us that in subroutine update, array phi is a 2-dimensional array of size 16x16. It is partitioned in the first dimen- 
sion. Each process will calculate the partition phi (CAP_LOW_phi : CAP_HlGH_phi , 1:16) of array phi ( 1 : 16 , 1.16). In 
case of 4 processors and blockwise distribution, CAP_LOW_phi will be 1,5, 9. and 13 on processes 1,2,3, and 4, respectively, 
c ap_h iGH_ph i will be 4, 8, 12, and 16. This information enables us to compare the distributed array from the parallelized 
code with the corresponding sections of the undistributed array in the serial program. 

3.2 Comparing Data Residing in Different Address Spaces 

As indicated in Section 2.,we will perform relative debugging by instrumenting the programs with calls to subroutines 

that establish communication and perform the comparisons. One of the efficiency concerns we had in our design was avoid- 
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ing unnecessary copies of data values, especially large arrays. For example, in the case where the serial version of an array 
needs to be compared with its distributed analog on an element-by-element basis, we don’t want to transmit both arrays to a 
comparison agent. Instead, we would prefer to transmit one array to the address space of the other and perform the compar- 
ison there. 

Our prototype system uses two routines running in the address spaces of the target processes in order to accomplish the 
comparison of data from different processes. 

• One routine resides in the parallelized code and sends to the serial code either the local contributions of a distributed 
array or the local checksum of a distributed array depending on which way of comparison is selected. 

• The receiving routine is in the reference executable. It receives the local contribution from each process of the parallel 
executable and compares it with the corresponding data of the undistributed array. When a mismatch is detected, a 
special function is called to indicate that fact. 

As mentioned above, we will perform the comparisons at entry and exit of suspicious subroutine calls. If subl and sub2 
are both instrumented routines in the program, the following error scenarios are possible: 

• Values correct on entry to subl , wrong on exit from subl: The routine subl has been identified as the culprit and has 
to be further investigated. The error could also be in an un-instrumented routine called from subL 

• Values correct on entry to subl , wrong on entry to sub2: Routines subl and sub2 are on the same call stack. The error 
occurred before the call to subil possibly in an un-instrumented routine. The discrepancy could potentially be due to 
un-initialized data rather than an error. 

• Values correct on exit from subl , wrong on exit from sub2 : Routines subl and sub2 are on the same call stack. The 
error occurred after the return from subl y possibly in an un-instrumented routine. 

• Values correct on exit from sub l , wrong on entry to sub2: Routines subl and sub2 are not on the same call stack. The 
error occurred between the two calls, possibly in an un-instrumented routine. Again, the discrepancy could be due to 
un-initialized data rather than an error. 

In our prototype implementation, we are taking advantage of the fact that CAPTools tries to preserve the original program 
structure as much as possible. For example, no procedure inlining is being performed, and there is a one-to-one correspon- 
dence of the subroutine calls that occur in serial and parallel program. CAPTools does, however, perform procedure cloning, 
i.e., in some cases, parallel as well as serial versions of the same procedure exist. We are taking this fact into account by 
instrumenting the cloned subroutines as well. 

Another issue concerning comparing of data is determining the particular comparison test to use. Our prototype imple- 
mentation provides the following alternatives for checking of distributed array data: 

• The local checksums of all parts of the distributed array are added and compared to the checksum of the undistributed 
array. An error is reported if the checksum exceeds a user supplied threshold. No array distribution information is 
required to perform this comparison. 
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• The local checksum of each part of the distributed array is compared to the corresponding partial checksum of the 
undistributed array. An error is reported if one of the checksums exceeds a user supplied threshold. This method 
requires array distribution infotmation so that corresponding array sections car. be identified. 

• An element-by-element comparison of undistributed and distributed arrays is performed .This comparison, just as the 
previous method, requires array distribution information. 

3.3 Program Instrumentation 

The question arises how the calls to comparison routines get inserted into the executables. Having the user modify the 
source is clearly not an option for an automated debugging system. In our prototype we will use dynamic instrumentation 
based on the DyninstAPl, a dynamic code adaptation toolset from the University of Maryland [8]. We chose to use dynamic 
instrumentation for two main reasons 

• We wanted to avoid the context switches that would result from fine-grained instrumentation being executed in the 
debugger. Such context switches could slow down execution by several orders of magnitude [16]. 

• We also wanted to minimize what was required of the user. By using dynamic instrumentation we can avoid changes 

to the compilation process. 

The DyninstAPl comes as a set of library routines which provide a portable way of inserting new code into a running program. 
The new code segments can be used to instrument the program in such a way that execution time does not suffer unduly. Be- 
sides inserting code segments, the DvninstAPI allows operations like. 

• attaching to and detaching from a running process, 

• inserting or removing subroutine calls from the application program, 

• stopping, continuing, and terminating an application program, and 

• reading from and writing to areas of memory of the application program. 

For example, by using the DyninstAPl it is straightforward to patch a running program so that function execution counts 
are collected. While such a thing is also possible in a conventional debugger, its interpretation of each piece of instrumenta- 
tion would require several context switches. This can slow down the execution time of some programs by several orders of 
magnitude. When done using the DyninstAPl , the function counts will be collected in the address space of the program 
itself, and the effect on execution time is minimized. 

Another key feature of the DyninstAPl is that the interface is analogous to a machine-independent intermediate represen- 
tation of the instrumentation as an abstract syntax tree. This allows the same instrumentation code to be used on different 
platforms. 

In our relative debugging system, in order to insert calls to the above routines at appropriate points, we use a process that 
we call the Instrumentation Server ( IS). It uses the DyninstAPl to control and modify the executables. In a gdb- like manner, 
this program accepts commands from standard input. The most important commands are: 

• attach: attach to a process, 

• createPoint: create an instrumentation point in a process, and 

• insertCall: insert a function call at an instrumentation point. 
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The commands take arguments such as process ID, names of executables, routine names, and specifications of the arguments 
to be passed to the instrumentation functions. An example is provided in Figure 6. 

3.4 User Interface and System Coordination 

To coordinate the actions of the various software components involved and to provide an interface to the user, we use 
p2d2 y a portable distributed debugger developed at NASA Ames [19]. One of the goals of the p2d2 project is to build a 
debugger for distributed programs which is both portable across a variety of target machines and whose user interface scales 
to be able to debug at least 256 processes. The result of our work so far ( 10] is a debugger that runs on a variety of Unix- 
based machines and can be used on both MPI and PVM applications. To achieve the portability goal, p2d2 abstracted serial 
debugging objects and operations in i service layer. In the current implementation, this “debugger server” is in turn layered 
on gdb , the debugger from the Free Software Foundation [9], 

In recent work we extended p2d2 so that it could provide a global view of distributed data [11]. P2d2 collects the local 
data contributions from each processor and assembles a global picture. For this process information about how the array is 
distributed across the processors is necessary. P2d2 can obtain this information either from a database, such as the one pro- 
duced by CAPTools , or by having the user provide it via a dialog box. 

Getting the components described in Sections 3.1, 3.2, and 3.3 to cooperate to solve the relative debugging problem is the 
job of p2d2 . It retrieves the necessary information about critical routines and array distribution information as described 
above and passes it to the executables via IS . Just as in the case of the global array viewer, this information is obtained by 
probing the CAPTools database mentioned earlier. Having retrieved information about where to instrument, p2d2 then inter- 
acts with IS to insert the initialization call that provides the distribution information as well as the calls that move the data 
between processes and perform the comparison. 

4. Behind the Scenes of a Relative Debugging Session 

To illustrate what activities need to be coordinated, consider the following scenario from the user’s perspective. After 
having used CAPTools to parallelize a program S, the user runs the resulting code P and finds it doesn’t compute the same 
answer. To prepare for a p2d2 debugging session, the user has to link his application with a special version of MPi_lnic 
which provides the following functionality: 

• The process ID of each MPI process is written to a file. 

• The program is put into an infinite sleep loop. 

The added functionality allows us to attach to the processes of the parallelized version of the code before they progress in 
their execution. 

Now the user starts p2d2 with the command line: 

p2d2 -R •mpirun -np 4 P m S 

which requests that p2d2 compare the execution of “mpirun -np 4 with the execution of S . After p2d2 starts up, the 
following sequence of events occurs It is depicted in Figure 2. 
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► read/write 

attach 


FIGURE 2. Coordination of the comparison activities. 

1 . P2d2 starts a gdb to control execution of S. Then p2d2 requests that gdb insert a breakpoint at the beginning of process 

S and at the entry point of function p 2 d 2 Di f f Detected”. This is the function, discussed in Section 3.2, that gets 

called when a difference is detected. 

2. The user then selects an array in p2d2's source display and invokes the Run operation. P2d2 issues a run request to 
the gdb controlling S. 

3. P2d2 issues the shell command “mpirun -np 4 P\ 

4. The four processes resulting from that command record contact information, including their process ID s, in the file 
system. 

5. After it sees that the contact file has been created, p2d2 starts up IS. 

6. IS attaches to the process running S. 

7. P2d2 reads the parallel execution contact information from the file system. It then sends the attach requests to IS. 

8. IS attaches to the four processes running P. 

9. P2d2 consults the CAPTools database and retrieves information about distributed arrays and which functions will need 
to be instrumented. It also provides the local name in the function of the array that needs to be monitored. 

Then p2d2 and IS complete the instrumentation of the processes and proceed with the execution. 

• IS inserts instrumentation into the process running S and the four processes running P. 

• IS detaches from the P processes. 

• IS notifies p2d2 that the instrumentation of S is complete. P2d2 sends a “continue execution” request to the gdb con 
trolling S. 
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c 


c 


c 


program jacobi 

main routine 

double precision phi 2 (0:101,0:101) 
double precision oldphi 2 < 0 : 101 , 0 : 101 ) 

call setup_grid {phi2, ...) 
do iter = 1 , 100 

call copyphi (oldphi.2, phi2) 
call update (phi2, oldphi 2 , ... ) 

end do 

call output (phi 2 , nptsx, nptsy) 

return 

end 

subroutine output (phiJ, . . . ) 

Routine that prints the result 

do j =0, nptsx+1 
do i = 0, nptsy+1 

phi7 ( i , j ) = phi3 ( i , j ) 
end do 
end do 

do j = 0, nptsx+1 

write (8,*) (phi7(i,;j), i = 0, 

# nptsy+1) 

end do 
return 
end 

subroutine copyphi (oldphi 5, phi 5) 
Routine that saves old values 

do j = 0 , 43 
do i = 0, 43 

oldphi 5 (i , j ) = ph i 5 { i , j )• 
end do 
end do 
return 
end 


subroutine update (phi4, oldphi4, ...) 
Routine that updates array 

do j=0, nptsx+1 
do i=0, nptsy+1 

oldphi8 ( i , j ) = oldphi 4 { i , j ) 
end do 
end do 

do j=l, nptsx 
do i = l, nptsy 

phi4 ( i , j ) = 0 .25* (oldphi8 (i-1, j ) 

# + oldphi8 { i+1 , j ) 

# + oldphi 8 ( i , j -1 ) 

# + oldphi8 ( i , j +1) ) 
end do 

end do 
return 
end 

subroutine setup_grid (phi6, . . . ) 
Routine to set up the initial 
grid values 

do j=0, nptsx+1 
do i=0 , nptsy+1 
phi 6 ( i , j ) = 0.0 
end do 
end do 

do j=l, nptsx 
do i = l, nptsy 

phi6 ( i , j ) =1.0 
end do 
end do 
return 
end 


FIGURE 3. Jacobi program source outline. 


• When the inserted instrumentation in S and P is executed, it establishes communication links between the serial and 
parallel processes. In our current prototype implementation, we use named pipes for communication. 

• At the function entry and exit [>oints that were instrumented, the parallel processes send their state information to the 
serial process. It compares the parallel data to its own. If there is a difference, it calls — p2d2Dif f Detected, which 
causes a trap because of the breakpoint that was set there. 

• When p2d2 is notified of the trap, it determines that the cause is a difference in state between the serial and parallel 
executions. It then presents the information to the user. 


5. An Example Debugging Session 

Suppose a user is parallelizing a code with CAPTools. The code is a very simple Fortran implementation of a Jacobi iteration 
algorithm. An outline of it is shown in Figure 3. 

During the parallelization process, the user examines the dependence graph with an eye for removing dependence edges 
that might result in unnecessary communication or might prevent parallelization altogether. While this gives the user an 
opportunity to improve code performance, it can also result in incorrectly behaving code. 


13 



±> 


Typui EXACT TRUE L**«fc loop lnd*p#rutont 

In toHtlra: 

Cavstd Sy Array: otdph» PRE POST 

Satire* Una: 9:oldphWlj>-oJdphWlj) 

Sink Lina: 14:phtfaj)HD.23*(o<dDh.e<l-l.J)+ o«lpM6<l+1.J> ^dphiadj-D+oldphiedi-Hl) 
During Iterations of Loup: 

Array liutoc 3 iJ Okprara: I - 1*1 

Da pa ndanca Status: Oafinit* 

Q Questions: Defining Statement Call Path: 


Damlnatlan: PRE POST 


ndas displayed 



Figure 4 shows the CAPTools display of the dependence edges between the two loops in routine update. After inspect- 
ing each edge, the user decides to remove the one resulting from the definition of element oldphi8 ( i , j ) in loop 1 and the 
reference of oldphiS ( i+l , j ) in loop 2. 

The user then runs the resulting code (named 4k par_test") and notices that the values of array phi7 printed in routine 
output are different from those printed by a run of the sequential version. He then invokes the p2d2 debugger in relative 
debugging mode with the command: 

p2d2 -R ’‘rnpirun -np 4 par_tes:* serial_test 
where “serial_test" is the name of the serial executable. 

When the p2d2 display appears* the user brings up a dialog box and asks for value of variable phi 7 in routine output to 
be monitored during execution (see Figure 5). 

After the user has requested the start of execution, the CAPTools database is probed behind the scenes. The probe deter- 
mines that the following arrays shoulc be checked at entry and exit of the corresponding routines: 

• oldphiS in copyphi 

• phi4 in update 

• phi6 in setup_grid 
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subroutine output (phi3, npt 3 x, 
implicit none 
integer nctsx, nptsy , l, j 
double orecision phi3 (0:43, 0: 
double precision phi? (0:43, 0: 

do J = 0, rptsx+l 
do i = 0, nptsy + 1 

pm? < i , j) = ptu3 ( i ,j) 
end do 
enj do 


function variable to insert in check [1st: 
output :phi? 


Add variable 


Cancel 


Help 


do j = 0, nptsx+1 
-rite (8,«) (Bl 
end do 
return 


|( i , j) , i * 0, nptsy-*-!) 


/We; testnew f 

line 42 oi testrew i selected fsee Hetol 


Pause 


Run 


| Step info Step over Step out 


display process (knot yet started 


; Evaluate Display 


output : 


FIGURE 5. Preparing for a relative debugging run in p2d2. 

After starting both executions, the routines are instrumented and execution continued. The comparison subsequently 


detects a difference in phi4 on the exit of update. The debugger then displays this message: 


; A difference wme detected In variable 'phW* when odting front function 'update*. 
1 The variable had tested equal when enuring fcirtedon update’. 



which brackets the error to execution of routine update. When the user inspects the parallel version of that routine: 


subroutine update (phi4 , oldphi4 , . . . ) 

do j=0 , nptsx+1 , 1 

do i=MAX { 0 , CAP_BLphi 1 ) , 

# MIN ( nptsy+1 , CAP_BHphi 1 ) , 1 
oldphi8 ( i , j ) =oldphi4 ( i , j ) 

enddo 

enddo 

CALL CAP_BEXCHANGE ( 

# oldphi8 (CAP_Loldphi4-l , 1) , 

# oldphi8 (CAP_Holdphi4 , 1), 

# . . . , CAP_LEFT) 
do j = 1 , nptsx, 1 

do i=MAX ( 1 , CAP_Loldphi4 ) , 

# MIN (nptsy, CAP_Holdphi4) , 1 

phi4 ( i, j ) =0 . 25’ (oldphi8 { i-1, j ) + oldphi8 ( i+1 , j ) 

# + oldphi8 ( i , j -1 ) + oldphi8 ( i , j fl) ) 

enddo 
enddo 
return 
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END 

he sees that there is an interchange of the values of oldphi8 to the left side to obtain the required values of oldphi.8 <i- 
1 , j ) , but there is no interchange on the right side. Therefore, stale values of oldphi8 ( i+1 , j ) are used to update phi4. The 
missing communication routine is due to the erroneous removal of the dependence edge earlier. 

6. Prototype Performance Evaluation 

The purpose of this section is to evaluate the feasibility of our implementation approach. In our prototype implementation 
we have instrumented the code dynamically using our instrumentation server IS (and the DyninstAPI ). An alternative way to 
implement our relative debugging system is to run the processes under a debugger such as gdb and invoke the comparison 
routines at breakpoints on entry and exit of subroutine calls. In both cases the comparison routines must be linked with the 
executable. Using IS, the call is actually patched into the code. The application is allowed to run without intervention and the 
call to the comparison routines will be executed on every entry to and exit from suspicious routines once the target process is 
continued. By contrast, in the gdb approach the target processes trap twice on every instrumented call, at which point gdb 
orchestrates the call to the instrumentation routine and the continuation of execution after that call returns. This potentially 
causes many context switches. To quantify how this affects the overall runtime of the relative debugging process we con- 
ducted a number of timing experiments which shall be discussed in this section. 

6.1 Description of the test environment 

In order to study the effect of both methods on execution time, we set up a simple test environment which does not 
include the p2d2 debugger. As a sample application, we used an MPI code and compared data between a single and a multi- 
ple process MPI run. The single process MPI run is considered the reference version producing the correct results. Commu- 
nication between the two executable; is implemented by using named pipes and is performed by the comparison routines 
discussed earlier. The single and multiple process executions synchronize after each comparison. In our timings, we mea- 
sured the elapsed execution time of the reference program. We built single and multiple process executable using the special 
version of MPl_init which was described in Section 4. The added functionality allows us to attach to the processes with IS 
or with gdb, respectively, before they progress in their execution. 

In our timing tests we want to insert the function call compit(argl) at the entry and exit point of function 
subl (argl) . The source code is contained in file test . f and the name of the executable is a . out . Each approach is 
implemented using a simple shell script. For both methods the shell script starts the executables and reads the file containing 
the process IDs. When using dynamic instrumentation we proceed as follows: For each process, we attach with IS and issue 
the sequence of commands as given in Figure 6. 

The first command will break the process out of the infinite loop. The subsequent commands create instrumentation 
points at entry and exit of subroutine subl and insert calls to the comparison routine compit. Then IS continues the execu- 
tion of the executable and detaches. 

In the case of using the debugger we attach with gdb to each of the processes. Each gdb is started with a command file as 
shown in Figure 6. The command file breaks the processes out of the infinite loop, sets breakpoints at beginning and end of 


16 



IS coamnands: 

set pid __G0 1 

createPoint pid subl_ ENTRV 
insertCall pid compit_ 0 
createPoint pid subl_ EXIT 
insertCall pid compit_ 0 
continue pid 
detach pid 


gdb command* : 

set _GO 1 

break test.f: 50 ! entry to subl 

commands 

call compit2 (argl) 
continue 
end 

break test.f: 63 ! exit from subl 

commands 

call compit2 (argl) 
continue 
end 

continue 


FIGURE 6. Command sequence for IS and corresponding commands for gdb 

subroutine subl, and ensures that a call to the comparison routines is executed each time a breakpoint is hit. Note that gdb 
does not allow us to set breakpoints at exits of subroutine calls. We therefore need to set the breakpoint at the appropriate 
line number. 

The computationally intensive pari of the MPI application is a two dimensional matrix multiplication. As a method for 
comparison we chose an element-by-element comparison, issuing a warning message when the difference between two ele- 
ments exceeds a pre-determined bound. In our timing experiments we vary the following parameters: 

• NP : The number of processes in the multiple process execution. 

• NLEN: The size of the array dimensions in the single process execution. The local dimensions in the multi-process 
execution are NLENIsqn(NP). Each process of the multiple process run sends its local array to the single process exe- 
cution. Therefore, NLEN*NLES double precision array elements are transferred at each comparison. 

• NCOUNT: The number of times that the suspicious subroutine is being called.This means that 2*NCOUNT compari- 
sons are performed. 

• NCOMPUTE: The length of the loops in the matrix-multiplication loop. The number of calculations between each 
comparison is therefore of order NCOMPUTE*NCOMPUTE*NCOMPUTE. 

By varying these four parameters we can investigate the effect of execution time of both relative debugging methods with 
respect to parallel scalability, and size of the data being compared, as well as frequency and granularity of the required com- 
parisons. Our timings were performed on an SGI 0rigin2000 with a 400 MHz clock rate. We used NP+ 1 CPUs for our tim- 
ing experiments. 
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6.2 Discussion of the timing results 

In our tests, the gdb-bascd approach had a more severe impact on the execution time than using dynamic instrumenta- 
tion, even when the number of comparisons performed is low and the granularity of the instrumentation is medium. Timings 
for cases in which a small amount of data is compared are given in Figure 7, Figure 8, and Figure 10. In these cases, execu- 
tion time of the relative debugging run increases with increasing number of processes. The increase is much higher with gdb 

than /S. 

If a large amount of data has to be compared, the difference between using gdb and using IS is less apparent. Running 
these cases with the SGI performance analysis tools showed that the execution time is dominated by the time spent perform- 
ing the actual comparison of the data. Timings for cases with large amounts of data are shown in Figure 9 and Figure 1 1. 

In our test setup, we also compared the runtime of dynamically instrumented executables to that of executables whose 
source code had been instrumented and re-compiled. Except in test case 1, where the execution time is extremely short, we 
found that there was basically no difference in the runtime. The runtime of test case 1 increased by a factor of 9 in the worst 
case when dynamic instrumentation was used. 

The timings indicate that dynamic instrumentation, when it comes to relative debugging, imposes little overhead on exe- 
cution time when small amounts of data are compared many times. That is the strength of dynamic instrumentation. This 
corresponds to a fine-grained instrumentation on statement level. Unfortunately, at this point the implementation of Dyninst 
for the SGI does not allow for arbitrary instrumentation, but only at entry and exit of subroutine calls. 

From our timings we can also draw conclusions about how relative debugging affects the runtime of the execution. In all 
of our test cases we compared the run time of the un-instrumented sequential executable to execution time of a full relative 
debugging session. Of course, these timings strongly depend on which kind of comparison test is used, how often an error is 
encountered, and how much information is printed by the comparison routines. The element-by-element comparison that we 
chose in our tests is relatively expensive. In test cases 1, 2, and 3, the runtime of the un-instrumented program takes only a 
fraction of a second and increases to up to a minute during a relative debugging session. These cases correspond to fine- 
grained checking with little computation being performed between the comparisons. For test case 4, the runtime increases 
from 4 seconds to about 80 seconds in the worst case. In test case 5, which corresponds to a coarse-grained instrumentation, 
the runtime increases from about 200 seconds to 240 seconds in the worst case. The timings show that a full relative debug- 
ging session could take a long time depending on the runtime of the un-instrumented code, the size of the data that needs to 
be checked, and the granularity of the checks. The advantage of the method is that once the user has set up the relative 
debugging run by indicating an initial suspect variable, the debugging session can run in batch mode, without any further 
user interaction. The p2d2 debugger currently does not support batch processing, and therefore our prototype does not sup- 
port it either. The timings, however, suggest strongly that such an interface is necessary. 

Even when running in batch mode, care has to be taken when fine-grained comparison is being used. A completely instru- 
mented program making fine-grained comparisons could take a very long time to run. It would probably be faster to make 
multiple runs of the code with the instrumentation points changing to zero in on the first difference. In the future work sec- 
tion below we discuss such a possibility. 
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7. Implementation experiences 

While we see great promise in the progress to date toward our goal of automatic support for debugging tool -parallelized 
programs, we have also observed limitations. Many of these restrictions are imposed by the foundation software that we 
have used to build our implementation. For example, the currently released version of CAPTools does not provide informa- 
tion in its database about arrays distributed across more than one dimension, since this information is not relevant for the 
typical use of CAPTools. However, th s is not an inherent restriction in CAPTools, and the developers can provide us with a 
special version of the database that will allow us to retrieve this kind of information. 

In the case of Dyninst , the implementations for the platforms we tested are restricted in several ways. Perhaps the most 
significant is that instrumentation can currently only be placed at subroutine entry and exit. It is our understanding that, 
eventually, the package will permit instrumentation at arbitrary instructions in the code, effectively removing this restriction. 
In addition to this limitation, code patched in by the version of the DyninstAPl that we are using is unable to access function 
parameters in the ninth position or thereafter. This problem is particularly felt in Fortran codes, where long parameter lists 
are common. Dyninst also has a limited knowledge of the symbol table. In particular, it knows the location of global vari- 
ables, but not locals or parameters. 

We can get around some of the Dyninst symbol table limitations by using gdb to get that information. Unfortunately, 
operating system issues come up when both our Dyninst- based instrumentation server and gdb want to attach to the same 
process. On some systems, such as Linux, only one can be attached at a time. In that case, our implementation will need to 
coordinate attach and detach requests. Our experience on Linux shows, however, that a process cannot successfully be 
attached by Dyninst after it has been detached. For the purposes of the prototype, we restricted ourselves to an IRIX imple- 
mentation where both gdb and our DyTmjf- based instrumentation server could be attached at the same time. 

Other issues also arise as a result of trying to debug Dy/i/nsf-instrumented codes. For example, when execution stops in a 
routine called from an instrumentation point, the runtime stack is in a state that gdb cannot handle — there is a return address 
on the stack that is outside the range that gdb is looking for. 

In addition to limitations in the software packages used by the prototype, we should also point out some within the proto- 
type itself. In particular, the current implementation requires the target executables to include the instrumentation routines 
described in Section 3.2. We plan to use dynamic linking in the future to address this restriction. 

One additional limitation of our prototype is that, while comparing executions we currently require that the checkpoints 
to be compared occur in the same order in the two runs. In the future we can address this restriction in a manner similar to 
Guard [2] by saving out-of-sequence checkpoints in the file system until the comparison can be made. 

8. Related Work 

Guard [1][2][3] is a relative debugger for parallel programs developed at the Griffith University in Brisbane Australia. In 
contrast to our approach, where two executables communicate data directly with each other and do the comparison, in 
Guard the debugger collects the data from the executables and does the comparison. Also, Guard does not aim particularly 
at automatically parallelized programs. Information about where to do the comparisons and which parts of the data to com- 
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pare are provided by the user via the command language. To compare array data from parallel programs, the user must 
describe the decomposition manually using a distributed array syntax. 

In other work on debugging automatically parallelized programs. Cohn [6] has investigated having the debugger provide 
a sequential view of an executing parallel program. While he does not use relative debugging techniques, his analysis of con- 
sistency issues between sequential and parallel executions could be useful in identifying candidate instrumentation points 
for making comparisons in the relative debugging approach. 

The idea of using information from parallelization tools to aid in debugging has also been around for some time. For 
example. Hood, Kennedy, and Mellor-Crummey [13] used dependence information from a parallelizing compiler to deter- 
mine which data accesses to instrument to find races in a shared-memory program execution. 

9. Project Status and Future Work 

We have built a prototype of a relative debugging system for comparing serial codes and their tool-produced parallel 
counterparts where array comparisons are done either by computing checksums or by doing element-by-element equality 
tests. After the user specifies a variable and scope to be checked, the debugger uses the CAPTools database to determine 
which variables should be monitored and in which functions. We used the dynamic instrumentation tool Dyninst in order to 
minimize the overhead involved in making the comparisons. We ran extensive timings and tested the need in such an envi- 
ronment for dynamically inserted procedure calls versus interpreted calls. 

In the near future we will integrate the relative debugging features more seamless into p2d2. In particular, we would like 
to have debugging requests that the user makes on the serial code, also be performed on the parallel version. In order to do 
that, we need to modify the p2d2 uset interface to support multiple computations executing simultaneously. In addition, we 
must get CAPTools to provide information about how the serial program was transformed into its parallel form. This will 
permit us to determine places in the c ode where there should be consistency between states in sequential and parallel ver- 
sions. 

Furthermore, while CAPTools allows for cyclic and block-cyclic array distributions, we currently support only blockwise 
distributions. In the future we will address this issue. 

Our approach for relative debugging of tool-parallelized distributed memory codes will also work for shared memory 
codes parallelized with tool support. In the near future we will extend our prototype to work with codes produced by CAPO 
[5] which is based on CAPTools and produces OpenMP [18] codes. 

In the longer term, we would like to use intraprocedural dataflow information from CAPTools in order to pinpoint execu- 
tion differences to particular statements, rather than procedure bodies. We recently began working with Steve Johnson, of 
the University of Greenwich, on an implementation that uses a combination of state examination and re-execution to back- 
track detected differences to the first place where they occur. In this approach instrumentation is inserted at USE points of 
variables to compare values across programs. When a difference is detected, the debugger uses information from CAPTools 
to find the possible definition points of the variable. If the values used on the right-hand-sides of the definition points are still 
live, the debugger checks for differences in them. If a difference is detected, the process is repeated looking at the definition 
points of the new difference variable. If the values of the right-hand-side variables have been overwritten by subsequent exe- 
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cution, then these USEs are instrumented and the program is rerun. Viewed another way, we are essentially using program 
slice information [14] to help minimize the number of instrumentation points, and we are using liveness information to help 
minimize the number of re-executions. Our expectation is that this approach will automatically isolate bugs with great preci- 
sion. 

Besides the relative debugging woik, we would also like to experiment with other uses for dynamic instrumentation in 
debugging. For example, we would like to use Dyninst to provide fast conditional breakpoints in p2d2. 

10. Conclusions 

In this paper we have described a system that simplifies the process of debugging programs produced by computer-aided 
parallelization tools. The system uses relative debugging techniques to compare serial and parallel executions in order to 
show where the computations begin to differ. It uses information produced by the parallelization tool to drive the compari- 
son process without user intervention. In addition, the use of dynamic instrumentation makes the comparisons efficient. We 
feel that this approach holds great promise for meeting the goal of providing automated support for isolating bugs intro- 
duced in the parallelization process. 
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FIGURE 7. Small amount of data, small number of computations, few comparlsions 
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Case 2:NLEN - 32,NCOUNfT“600,NCOMPUTE“1 



FIGURE 8. Small amount of data, small number of computations, many comparisons 


Case 3:NLEN-1024,NCOUNT-10,NCOMPUTE»1 


36 


30 

25 

20 

15 

10 

5 

0 



■ 

■ "1 

1 ■ 

■ ■ 


M ll J 


8 16 


■ Dyninst 

■ gdb 


Number of processes 


FIGURE 9. Large amount of data, small number of computations, few comparisons 
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Case 4:NLEN*32,NCOUNT*50O,NCOMPUTE=32 



FIGURE 10. Small amount of data, medium number of computations, many comparisons 



FIGURE 11. Large amount of data, large number of computations, few comparisons 
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